DAnIEL: Language Independent Character-Based News Surveillance
نویسندگان
چکیده
This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is rst reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, ltered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text speci c properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F1-measure score around 85%. Two issues are addressed: the rst is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.
منابع مشابه
Multilingual event extraction for epidemic detection
OBJECTIVE This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed...
متن کاملA Language-Independent Transliteration Schema Using Character Aligned Models at NEWS 2009
In this paper we present a statistical transliteration technique that is language independent. This technique uses statistical alignment models and Conditional Random Fields (CRF). Statistical alignment models maximizes the probability of the observed (source, target) word pairs using the expectation maximization algorithm and then the character level alignments are set to maximum posterior pre...
متن کامل2 Toshio
The automatic extraction and recognition of news captions and annotations can be of great help locating topics of interest in digital news video libraries. To achieve this goal, we present a technique, called Video OCR (Optical Character Reader), which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operat...
متن کاملAn XML Based Interactive Multimedia News System
We present a prototype of an interactive multimedia news system featuring a talking virtual character to present the news on the Web. Talking virtual characters are graphical simulations of real or imaginary persons capable of human-like behavior, most importantly talking and gesturing. In our system a virtual character is used as a newscaster, reading the news on the Web while at the same time...
متن کاملRecognition of Superimposed Caption
The automatic extraction and reading of news captions and annotations can be of great help locating topics of interest in digital news video archives. To achieve this goal, we present a technique, called Video OCR, which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operates, and suggest applications for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012